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Abstract 

We explore multi-scale convolutional neural nets (CNNs) 
for image classification. Contemporary approaches extract 
features from a single output layer. By extracting features 
from multiple layers, one can simultaneously reason about 
high, mid, and low-level features during classification. The 
resulting multi-scale architecture can itself be seen as a 
feed-forward model that is structured as a directed acyclic 
graph (DAG-CNNs). We use DAG-CNNs to learn a set of 
multiscale features that can be effectively shared between 
coarse and fine-grained classification tasks. While fine- 
tuning such models helps performance, we show that even 
''off-the-self’ multiscale features perform quite well. We 
present extensive analysis and demonstrate state-of-the-art 
classification performance on three standard scene bench¬ 
marks (SUN397, MIT67, and Scene 15). In terms of the 
heavily benchmarked MIT67 and Scene 15 datasets, our re¬ 
sults reduce the lowest previously-reported error by 23 . 9 % 
and 9 . 5 %, respectively. 

1. Introduction 

Deep convolutional neural nets (CNNs), pioneered by 
Lecun and collaborators [18], now produce state-of-the-art 
performance on many visual recognition tasks [16, 29, 33]. 
An attractive property is that appear to serve as univer¬ 
sal feature extractors, either as “off-the-shelf” features or 
through a small amount of “fine tuning”. CNNs trained on 
particular tasks such as large-scale image classification [5] 
transfer extraordinarily well to other tasks such as object 
detection [11], scene recognition [39], image retrieval [12], 
etc [26]. 

Hierarchical chain models: CNNs are hierarchical 
feed-forward architectures that compute progressively in¬ 
variant representations of the input image. However, the ap¬ 
propriate level of invariance might be task-dependent. Dis¬ 
tinguishing people and dogs requires a representation that 
is robust to large spatial deformations, since people and 
dogs can articulate. However, fine-grained categorization 
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Figure 1. Recognition typically require features at multiple scales. 
Distinguishing a person versus dog requires highly invariant fea¬ 
tures robust to the deformation of each category. On the other 
hand, fine-grained recognition likely requires detailed shape cues 
to distinguish models of cars (top). We use these observations to 
revisit deep convolutional neural net (CNN) architectures. Typical 
approaches train a classifier using features from a single output 
layer (left). We extract multi-scale features from multiple layers 
to simultaneously distinguish coarse and fine classes. Such fea¬ 
tures come “for free” since they are already computed during the 
feed-forward pass (middle). Interestingly, the entire multi-scale 
predictor is still a feed-forward architecture that is no longer chain- 
structured, but a directed-acyclic graph (DAG) (right). We show 
that DAG-CNNs can be discriminatively trained in an end-to-end 
fashion, yielding state-of-the-art recognition results across various 
recognition benchmarks. 


of car models (or bird species) requires fine-scale features 
that capture subtle shape cues. We argue that a universal 
architecture capable of both tasks must employ some form 
of multi-scale features for output prediction. 
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Figure 2. Retrieval results using L2 distance for both mid- and high-level features on MIT67 [24], computed from layer 11 and 20 of the 
Caffe model. Green (Red) box means correct (wrong) results, in terms of the scene category label. The correct label for wrong retrievals 
are provided. The retrieval results are displayed such that the left-most image has the closest distance to the query, and vice versa. Certain 
query images (or categories) produce better matches with high-level features, while others produce better results with mid-level features. 
This motivates our multi-scale approach. 


Multi-scale representations: Multi-scale representa¬ 
tions are a classic concept in computer vision, dating back 
to image pyramids [4], scale-space theory [20], and mul¬ 
tiresolution models [21]. Though somewhat fundamental 
notions, they have not been tightly integrated with contem¬ 
porary feed-forward approaches for recognition. We in¬ 
troduce multi-scale CNN architectures that use features at 
multiple scales for output prediction (Fig. 1). From one 
perspective, our architectures are quite simple. Typical ap¬ 
proaches train a output predictor (e.g., a linear SVM) using 
features extracted from a single output layer. Instead, one 
can train an output predictor using features extracted from 
multiple layers. Note that these features come “for free”; 
they are already computed in a standard feed-forward pass. 

Spatial pooling: One difficulty with multi-scale ap¬ 
proaches is feature dimensionality - the total number of fea¬ 
tures across all layers can easily reach hundreds of thou¬ 
sands. This makes training even linear models difficult and 
prone to over-fitting. Instead, we use marginal activations 
computed from sum (or max) pooling across spatial loca¬ 
tions in a given activation layer. From this perspective, 
our models are similar to those that compute multi-scale 
features with spatial pooling, including multi-scale tem¬ 
plates [10], orderless models [12], spatial pyramids [17], and 
bag-of-words [31]. Our approach is most related to [12], 
who also use spatially pooled CNN features for scene clas¬ 
sification. They do so by pooling together multiple CNN 


descriptors (re)computed on various-sized patches within 
an image. Instead, we perform a single CNN encoding of 
the entire image, extracting multiscale features “for free”. 

End-to-end training: Our multi-scale model differs 
from such past work in another notable aspect. Our en¬ 
tire model is still a feed-forward CNN that is no longer 
chain-structured, but a directed-acyclic graph (DAG). DAG- 
structured CNNs can still be discriminatively trained in an 
end-to-end fashion, allowing us to directly learn multi-scale 
representations. DAG structures are relatively straightfor¬ 
ward to implement given the flexibility of many deep learn¬ 
ing toolboxes [34, 14]. Our primary contribution is the 
demonstration that structures can capture multiscale fea¬ 
tures, which in turn allow for transfer learning between 
coarse and fine-grained classification tasks. 

DAG Neural Networks: DAG-structured neural nets 
were explored earlier in the context of recurrent neural 
nets [1, 13]. Recurrent neural nets use feedback to cap¬ 
ture dynamic states, and so typically cannot be processed 
with feed-forward computations. More recently, networks 
have explored the use of “skip” connections between layers 
[25, 33, 28], similar to our multi-scale connections. [25] 
show that such connections are useful for a single binary 
classification task, but we motivate multiscale connections 
through multitask learning: different visual classification 
tasks require features at different image scales. Finally, the 
recent state-of-the-art model of [33] make use of skip con- 


























nections for training, but does not use them at test-time. 
This means that their final feedforward predictor is not a 
DAG. Our results suggest that adding multiscale connec¬ 
tions at testtime might further improve their performance. 

Overview: We motivate our multi-scale DAG-CNN 
model in Sec. 2, describe the full architecture in Sec. 3, 
and conclude with numerous benchmark results in Sec. 4. 
We evaluate multi-scale DAG-structured variants of ex¬ 
isting CNN architectures (e.g.. Caffe [14], Deepl9 [29]) 
on a variety of scene recognition benchmarks including 
SUN397 [38], MIT67 [24], ScenelS [9]. We observe a 
consistent improvement regardless of the underlying CNN 
architecture, producing state-of-the-art results on all 3 
datasets. 

2. Motivation 

In this section, we motivate our multi-scale architec¬ 
ture with a series of empirical analysis. We carry out an 
analysis on existing CNN architectures, namely Caffe and 
Deepl9. Caffe [14] is a broadly used CNN toolbox. It in¬ 
cludes a pre-trained model “AlexNet” [16] model, learned 
with millions of images from the ImageNet dataset [5]. This 
model has 6 conv. layers and 2 fully-connected (FC) layers. 
Deep 19 [29] uses very small 3x3 receptive fields, but an 
increased number of layers - 19 layers (16 conv. and 3 FC 
layers). This model produced state-of-the-art performance 
in ILSVRC-2014 classification challenge [27]. We evalu¬ 
ate both “off-the-shelf” pre-trained models on the heavily 
benchmarked MIT Indoor Scene (MIT67) dataset [24], us¬ 
ing 10-fold cross-validation. 

2.1. Single-scale models 

Image retrieval: Recent work has explored sparse re¬ 
construction techniques for visualizing and analyzing fea¬ 
tures [35]. Inspired by such techniques, we use image re¬ 
trieval to begin our exploration. We attempt to “reconstruct” 
a query image by finding M = 7 closest images in terms 
of L2-distance, when computed with mean-pooled layer- 
specific activations. Results are shown for two query im¬ 
ages and two Caffe layers in Fig. 2. The florist query 
image tends to produces better matches when using mid¬ 
level features that appear to capture objects and parts. On 
the other hand, the church-inside query image tends to 
produce better matches when using high-level features that 
appear to capture more global scene statistics. 

Single-scale classification: Following past work [26], 
we train a linear SVM classifier using features extracted 
from a particular layer. We specifically train K = 67 1- 
vs-all linear classifiers. We plot the performance of single¬ 
layer classifiers in Fig. 3. The detailed parameter options for 
both Caffe model are described later in Sec. 4. As past work 
has pointed out, we see a general increase in performance 
as we use higher-level (more invariant) features. We do see 
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Figure 3. The classification accuracy on MIT67 [24] using activa¬ 
tions from each layer. We use a solid color fill representing the 
output of a ReLU layer, where there are 7 in total for the Caffe 
model. We tend to see a performance jump at each successive 
ReLU layer, particularly earlier on in the model. 

a slight improvement at each nonlinear activation (ReLU) 
layer. This makes sense as this layer introduces a nonlinear 
rectification operation max(0,x), while other layers (such 
an convolutional or sum-pooling) are linear operations that 
can be learned by a linear predictor. 

Scale-varying classification: The above experiment re¬ 
quired training K x N 1-vs-all classifiers, where K is the 
number of classes and N is the number of layers. We can 
treat each of the KN classifiers as binary predictors, and 
score each with the number of correct detections of its tar¬ 
get class. We plot these scores as a matrix in Fig. 4. We tend 
to see groups of classes that operate best with features com¬ 
puted from particular high-level or mid-level layers. Most 
categories tend to do well with high-level features, but a 
significant fraction (over a third) do better with mid-level 
features. 

Spatial pooling: In the next section, we will explore 
multi-scale features. One practical hurdle to including all 
features from all layers is the massive increase in dimen¬ 
sionality. Here, we explore strategies for reducing di¬ 
mensionality through pooled features. We consider vari¬ 
ous pooling strategies (sum, average, and max), pooling 
neighborhoods, and normalization post-processing (with 
and without L2 normalization). We saw good results with 
average pooling over all spatial locations, followed by L2 
normalization. Specifically, assume a particular layer is of 
size H X W X F, where H is the height, W is the width, 
and F is the number of filter channels. We compute a 
1X1X F feature by averaging across spatial dimensions. We 
then normalize this feature to have unit norm. We compare 
this encoding versus the unpooled full-dimensional feature 
(also normalized) in Fig. 5. Pooled features always do bet¬ 
ter, implying dimensionality reduction actually helps per¬ 
formance. We verified that this phenomena was due to over¬ 
fitting; the full features always performed better on training 
data, but performed worse on test data. This suggests that 
with additional training data, less-aggressive pooling (that 













Layer-11 


Layer-13 


Layer-15 


Layer-18 


Layer-20 



Figure 4. The “per-class” performance using features extracted from particular layers. We group classes with identical best-performing 
layers. The last layer ( 20 ) is optimal for only 15 classes, while the second-to-last layer ( 18 ) proves most discriminative for 26 classes. The 
third-most effective layer ( 11 ) captures significantly lower-level level features. These results validate our underlying hypothesis; different 
classes require different amounts of invariance. This suggests that a feature extractor shared across such classes will be more effective 
when multi-scale. 
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Figure 5. Marginal features (computed by spatially-pooling across 
activations from a layer) do as well or better than their full¬ 
dimensional counterpart. 


preserves some spatial information) may perform better. 

2.2. Multi-scale models 

Multiscale classification: We now explore multi-scale 
predictors that process pooled features extracted from mul¬ 
tiple layers. As before, we analyze “off-the-shelf” pre¬ 
trained models. We evaluate performance as we iteratively 
add more layers. Fig. 3 suggests that the last ReLU layer 
is a good starting point due to its strong single-scale per¬ 
formance. Fig 6(a) plots performance as we add previous 
layers to the classifier feature set. Performance increases as 
we add intermediate layers, while lower layers prove less 
helpful (and may even hurt performance, likely do to over¬ 
fitting). Our observations suggest that high and mid-level 
features (i.e., parts and objects) are more useful than low- 
features based on edges or textures in scene classification. 

Multi-scale selection: The previous results show that 
adding all layers may actually hurt performance. We veri¬ 
fied that this was an over-fitting phenomena; additional lay¬ 
ers always improved training performance, but could de¬ 
crease test performance due to over-fitting. This appears es¬ 
pecially true for multi-scale analysis, where nearby layers 
may encoded redundant or correlated information (that is 
susceptible to over-fitting). Ideally, we would like to search 
for the “optimal” combination of ReLU layers that max- 




(a) Multi-scale (b) Forward Selection 

Figure 6. (a) The performance of a multi-scale classifier as we add 
more layer-specific features. We start with the last ReLU layer, 
and iteratively add the previous ReLU layer to the feature set of 
the classifier. The “+” sign means the recent-most added layer. 
Adding additional layers help, but performance saturates and even 
slightly decreases when adding lower-layer features. This sug¬ 
gests it may be helpful to search for the “optimal” combination of 
layers, (b) The performance trend when using forward selection 
to incorporate the ReLU layers. Note that layers are not selected 
in high-to-low order. Specifically, it begin with the second-to-last 
ReLU layer, and skip one or more previous layers when adding 
the next layer. This suggests that layers encode some redundant or 
correlated information. Overall, we see a significant 6% improve¬ 
ment. 


imize performance on validation data. Since there exists 
an exponential number of combinations (2^ for N ReLU 
layers), we find an approximate solution with a greedy 
forward-selection strategy. We greedily select the next-best 
layer (among all remaining layers) to add, until we observe 
no further performance improvement. As seen in Fig. 6(b), 
the optimal results of this greedy approach rejects the low- 
level features. This is congruent with the previous results in 
Fig. 6(a). 

Our analysis strongly suggest the importance (and ease) 
of incorporating multi-scale features for classification tasks. 
For our subsequent experiments, we use scales selected by 
the forward selection algorithm on MIT67 data (shown in 
Fig. 6(b)). Note that we use them for all our experimental 
benchmarks, demonstrating a degree of cross-dataset gen- 











































eralization in our approach. 

3. Approach 

In this section, we show that the multi-scale model ex¬ 
amined in Fig. 6(b) can be written as a DAG-structured, 
feed-forward CNN. Importantly, this allows for end-to-end 
gradient-based learning. To do so, we use standard calculus 
constructions - specifically the chain rule and partial deriva¬ 
tives - to generalize back-propagation to layers that have 
multiple “parents” or inputs. Though such DAG structures 
have been previously introduced by prior work, we have 
not seen derivations for the corresponding gradient compu¬ 
tations. We include them here for completeness, pointing 
out several opportunities for speedups given our particular 
structure. 

Model: The run-time behavior of our multi-scale predic¬ 
tor from the previous section is equivalent to feed-forward 
processing of the DAG-structured architecture in Fig. 6(b). 
Note that we have swapped out a margin-based hinge-loss 
(corresponding to a SVM) with a softmax function, as 
the latter is more amenable to training with current tool¬ 
boxes. Specifically, typical CNNs are grouped into collec¬ 
tions of four layers, i.e., Conv., ReLU, contrast normaliza¬ 
tion (Norm), pooling layers (with the Norm and pooling lay¬ 
ers being optional). The final layer is usually a Ff-way soft- 
max function that predicts one of K outputs. We visualize 
these layers as a chain-structured “backbone” in Fig. 7. Our 
DAG-CNN simply links each ReLU layer to an average¬ 
pooling layer, followed by a L2 normalization layer, which 
feeds to a fully-connected (FC) layer that produces K out¬ 
puts (represented formally as a 1 x 1 x AT matrix). These 
outputs are element-wise added together across all layers, 
and the resulting K numbers are fed into the final softmax 
function. The weights of the FC layers are equivalent to the 
weights of the final multi-scale 7f-way predictor (which is a 
softmax predictor for a softmax loss output, and a SVM for 
a hinge-loss output). Note that all the required operations 
are standard modules except for the Add. 

Training: Let wi, ...w^ be the CNN model parameters 
at 1,.., AT-th layer, training data be (x^*\y^*^), where 
is the i-th input image and is the indicator vector of 
the class of x^*\ Then we intend to solve the following 
optimization problem 
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Figure 8. Visualization of the parameter setup at z-th ReLU. 
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Multi-output layers (ReLU): Our DAG-model is struc¬ 
turally different at the ReLU layers (since they have multi¬ 
ple outputs) and the Add layer (since it has multiple inputs). 
We can still compute partial derivatives by recursively ap¬ 
plying the chain rule, but care needs to be taken at these 
points. Let us consider the z-th ReLU layer in Fig. 8. Let 
ai be its input, be the output for its j-th output branch 
(its child in the DAG), and let z is the final output of the 
softmax layer. The gradient of 2 : with respect to the input of 
the i-th ReLU layer can be computed as 


dz 

dai 


^ dz 

dai 


(in general) 


( 2 ) 


where C = 2 for the example in Fig. 8. One can recover 
standard back-propagation equations from the above by set¬ 
ting C = 1: a single back-prop signal arrives at ReLU 

unit i, is multiplied by the local gradient —hy, and is passed 

dcx\ 

on down to the next layer below. In our DAG, multiple back- 
prop signals arrive from each branch j, each is mul- 

tiplied by an branch-specific gradient and their total 

sum is passed on down to the next layer. 

Multi-input layers (Add): Let pk = 9{^^k \ ’ ” 1 
represents the output of a layer with multiple inputs. We can 
compute the gradient along the layer by applying the chain 
rule as follows: 

dz 
dai 


_ dz df3k 
d(3k dai 

dz V dPk 
9Pk ^ da\^^ 9ai 


(in general) (3) 


1 • 

argmin - V £(/(xW; Wi,W/f), yW) (1) 

As is now commonplace, we make use of stochastic gra¬ 
dient descent to minimize the objective function. For a tra¬ 
ditional chain model, the partial derivative of the output 
with respect to any one weight can be recursively computed 
by the chain rule, as described in the back-prop algorithm. 


One can similarly arrive at the standard back-propagation 
by setting C = 1. 

Special case (ReLU): Our particular DAG architecture 
can further simplify the above equations. Firstly, it may 
be common for multiple-output layers to duplicate the same 
output for each child branch. This is true of our ReLU units; 
they pass the same values to the next layer in the chain and 
the current-layer pooling operation. This means the output- 

specific gradients are identical for those outputs Vj, = 













Figure 7. Our multi-scale DAG-CNN architecture is constructed by adding multi-scale output connections to an underlying chain backbone 
from the original CNN. Specifically, for each scale, we spatially (average) pool activations, normalize them to have unit-norm, compute 
an inner product with a fully-connected (FC) layer with K outputs, and add the scores across all layers to predictions for K output classes 
(that are finally soft-maxed together). 
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, which simplifies (2) to 
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(for duplicate outputs) 


(4) 


This allows us to add together multiple back-prop signals 
before scaling them by the local gradient, reducing the num¬ 
ber of multiplications by C. We make use of this speed up 
to train our ReLU layers . 

Special case(Add): Similarly, our multi-input Add layer 
reuses the same partial gradient for each input Vj, = 

which simplifies even further in our case to 1. The 
resulting back-prop equations that simplify (3) are given by 


dz 

dai 


dz dPk da^k^ 
da^^^ ^ 9ai 


(for duplicate gradients) 

(5) 


ceive a strong gradient signal during learning. Fig. 9 exper¬ 
imentally verifies this claim. 


Gradient-based learning 
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Figure 9. The average gradient from Layer-1 (Conv.) during train¬ 
ing, plotted in log-scale. Gradients from the DAG are consistently 
10 X larger, implying that they receive a stronger supervised signal 
from the target label during gradient-based learning. 


4. Experimental Results 


implying that one can similarly save C multiplications. 
The above equations have an intuitive implementation; the 
standard chain-structured back-propagation signal is simply 
replicated along each of the parents of the Add layer. 

Implementation: We use the excellent MatConNet 
codebase to implement our modifications [34]. We im¬ 
plemented a custom Add layer and a custom DAG data- 
structure to denote layer connectivity. Training and testing 
is essentially as fast as the chain model. 

Vanishing gradients: We point out an interesting prop¬ 
erty of our multiscale models that make them easier to train. 
Vanishing gradients [2] refers to the phenomena that gra¬ 
dient magnitudes decrease as they are propogated through 
layers, implying that lower-layers can be difficult to learn 
because they receive too small a learning signal. In our 
DAG-CNNs, lower layers are directly connected to the out¬ 
put layer through multi-scale connections, ensuring they re- 


We explore DAG-structured variants of two popular deep 
models, Caffe [14] and Deep 19 [29]. We refer to these 
models as Caffe-DAG and Deep 19-DAG. We evaluate these 
models on three benchmark scene datasets: SUN397 [38], 
MIT67 [24], and Scene 15 [9]. In absolute terms, we achieve 
the best performance ever reported on all three benchmarks, 
sometimes by a significant margin. 

Feature dimensionality: Most existing methods that 
use CNNs as feature extractors work with the last layer 
(or the last fully connected layer), yielding a feature vec¬ 
tor of size 4096. Forward feature selection on Caffe- 
DAG selects layers (11,13,15,18,20), making the final 
multiscale feature 9216-dimensional. Deep 19-DAG selects 
layers(26, 28,31,33,30), for a final size of 6144. We per¬ 
form feature selection by cross-validating on MIT67, and 
use the same multiscale structure for all other datasets. 
Dataset-dependant feature selection may further improve 





























SUN397 


MIT67 


Scene15 


Approach 

Accuracy(%) 

Deepl9-DAG 

92.9 

Deepl9 [29] 

90.8 

Caffe-DAG 

89.7 

Caffe [14] 

86.8 

Place [39] 

91.6 

CENTRIST [36] 

84.8 

Hybrid [3] 

83.7 

Spatial pyramid [17] 

81.4 

Object bank [19] 

80.9 

Reconfigurable model [23] 

78.6 

Spatial Envelop [22] 

74.1 

Baseline [9] 

65.2 


Approach 

Accuracy(%) 

Deepl9-DAG 

56.2 

Deepl9 [29] 

51.9 

Caffe-DAG 

46.6 

Caffe [14] 

43.5 

Places [39] 

54.3 

MOP-CNN [12] 

52.0 

EV [32] 

47.2 

DeCaf [ ] 

40.9 

Baseline-overfeat [37] 

40.3 

Baseline [38] 

38.0 


Approach 

Accuracy (%) 

Deepl9-DAG 

77.5 

Deepl9 [29] 

70.8 

Caffe-DAG 

64.6 

Caffe [14] 

59.5 

MOP-CNN [12] 

68.9 

Places [39] 

68.2 

Mid-level [6] 

64.0 

PV-fBoP[15] 

63.2 

Disc. Patch [30] 

49.4 

SPM [17] 

34.4 


Table 1. Classification results on SUN397, MIT67, and Scenel5 datasets. Please see text for an additional discussion. 


performance. Our final multiscale DAG features are only 
2X larger than their single-scale counter part, making them 
practically easy to use and store. 

Training: We follow the standard image pre-processing 
steps of fixing the input image size to 224 x 224 by scal¬ 
ing and cropping, and subtracting out the mean RGB value 
(computed on ImageNet). We initialize filters and biases 
to their pre-trained values (tuned on ImageNet) and ini¬ 
tialize multi-scale fully-connected (FC) weights to small 
normally-distributed numbers. We perform 10 epochs of 
learning. 

Baselines: We compare our DAG models to published 
results, including two additional baselines. We evaluate the 
best single-scale “off-the-shelf” model, using both Caffe 
and Deep 19. We pass L2-normalized single-scale features 
to Liblinear [8] to train iC-way one-vs-all classifiers with 
default settings. Finally, Sec. 4.1 concludes with a detailed 
diagnostic analysis comparing off-the-shelf and fine-tuned 
versions of chain and DAG structures. 

SUN397: We tabulate results for all our benchmark 
datasets in Table 1, and discuss each in turn. SUN397 [38] 
is a large scene recognition dataset with lOOK images span¬ 
ning 397 categories, provided with standard train-test splits. 
Our DAG models outperform their single-scale counter¬ 
parts. In particular. Deep 19-DAG achieves the highest 
56.2% accuracy. Our results are particularly impressive 
given that the next-best method of [39] (with a score of 
54.3) makes use of a ImageNet-trained CNN and a custom- 
trained CNN using a new 7-million image dataset with 400 
scene categories. 

MIT67: MIT67 consists of 15K images spanning 67 in¬ 
door scene classes [24], provided with standard train/test 
splits. Indoor scenes are interesting for our analysis be¬ 
cause some scenes are well characterized by high-level spa¬ 
tial geometry {e.g. church and cloister), while oth¬ 
ers are better described by mid-level objects {e.g. wine 
celler and operating room) in various spatial con¬ 
figurations. We show qualitative results in Fig. 10. Deep 19- 


DAG produces a classification accuracy of 77.5%, reducing 
the best-previously reported error [12] by 23.9%. Interest¬ 
ingly [12] also uses multi-scale CNN features, but do so by 
first extracting various-sized patches from an image, rescal¬ 
ing each to canonical size. Single-scale CNN features ex¬ 
tracted from these patches are then vector-quantized into a 
large-vocabulary codebook, followed by a projection step 
to reduce dimensionality. Our multi-scale representation, 
while similar in spirit, is an end-to-end trainable model that 
is computed “for free” from a single (DAG) CNN. 

ScenelS: The Scene 15 [9] includes both indoor scene 
{e.g., store and kitchen) and outdoor scene {e.g., mountain 
and street). It is a relatively small dataset by contempo¬ 
rary standards (2985 test images), but we include here for 
completeness. Performance is consistent with the results 
above. Our multi-scale DAG model, specifically Deep 19- 
DAG, outperforms all prior work. For reference, the next- 
best method of [39] uses a new custom 7-million image 
scene dataset for training. 

4.1. Diagnostics 

In this section, we analyze “off-the-shelf” (OTS) and 
“fine-tuned” (FT) versions of both single-scale chain and 
multi-scale DAG models. We focus on the Caffe model, as 
it is faster and easier for diagnostic analysis. 

Chain: Chain-OTS uses single-scale features extracted 
from CNNs pre-trained on ImageNet. These are the base¬ 
line Caffe results presented in the previous subsections. 
Chain-FT trains a model on the target dataset, using the pre¬ 
trained model as an initialization. This can be done with 
standard software packages [34] . To ensure consistency of 
analysis, in both cases features are passed to a K-way multi¬ 
class SVM to learn the final predictor. 

DAG: DAG-OTS is obtained by fixing all internal filters 
and biases to their pre-trained values, and only learning the 
multiscale fully-connected (FC) weights. Because this fi¬ 
nal stage learning is a convex problem, this can be done by 
simply passing off-the-shelf multi-scale features to a con- 
















Figure 10. Deepl9-DAG results on MIT67. The category label is shown on the left and the label for false-positives (in red) are also 
provided. We use the multi-scale analysis of Fig. 4 to compare categories that perform better with mid-level features (top 2 rows) versus 
high-level features (bottom 2 rows). Mid-level features appear to emphasize objects such as operating equipment for ope rat ing room 
scenes and circular grid for wine celler, while high-level features appear to focus on global spatial statistics for inside bus and 
elevator scenes. 


vex linear classification package (e.g., SVM). We compare 
this model to a fine-tuned version that is trained end-to-end, 
making use of the modified backprop equation from Sec. 3. 

Comparison: Fig. 11 compares off-the-shelf and fine- 
tune variants of chain and DAG models. We see two dom¬ 
inant trends. First, as perhaps expected, fine-tuned (FT) 
models consistently outperform their off-the-shelf (OTS) 
countparts. Even more striking is the large improvement 
from chain to DAG models, indicating the power of multi¬ 
scale feature encodings. 

DAG-OTS: Perhaps most impressive is the strong per¬ 
formance of DAG-OTS. From a theoretical perspective, this 
validates our underyling hypothesis that multi-scale fea¬ 
tures allow for better transfer between recognition tasks - 
in this case, ImageNet and scene classification. An interest¬ 
ing question is whether multi-scale features, when trained 
with gradient-based DAG-learning on ImageNet, will al¬ 
low for even more transfer. We are currently exploring this. 
However even with current CNN architectures, our results 
suggest that any system making use of off-the-shelf CNN 
features should explore multi-scale variants as a ‘'cheap'’ 
baseline. Compared to their single-scale counterpart, mul¬ 
tiscale features require no additional time to compute, are 
only a factor of 2 larger to store, and consistently provide a 
noticeable improvement. 
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Figure 11. Off-the-shelf vs. Fine-tuning models on both Chain 
and DAG model for Caffe backbone. Please see the text for a 
discussion. 


Conclusion: We have introduced multi-scale CNNs for 
image classification. Such models encode scale-specific 
features that can be effectively shared across both coarse 
and fine-grained classification tasks. Importantly, such 
models can be viewed as DAG-structured feedforward pre¬ 
dictors, allowing for end-to-end training. While fine-tuning 
helps performance, we empirically demonstrate that even 
“off-the-self” multiscale features perform quite well. We 
present extensive analysis and demonstrate state-of-the-art 
classification performance on three standard scene bench¬ 
marks, sometimes improving upon prior art by a significant 
margin. 
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